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IWffight zone of protein sequence alignments 



Buzzard ItatU 5 

Sequen ce aKgmma ts unambiguously dtstingniali between 
prntein pairs of ihnflar and non-similar structure when 
™f pairwtoe sequence itottStj is <>40% for W 
arguments). The signal geto Wnrred in the twOight zoneccf 
2<WS% sequence identity. Here, more fian a milliim 
sequence alignments analysed between piutetn pairs 
of inoro rtrnctnres to re-define a Jfaw distfcngniabSns 
between true and false positive? for low levels of simlhtirf 
Four reflate stood but 0) The tranuition from the sa f e zone 
of sequence alignment into the twili^itzo&eisdeAcEibedby 
an explosion offclae negatives. More than 95% of all pairs 
detected to the twfligh* zone had different structures. More 
precisely, above a cut-off roughly earrespandinjg to 30% 
fteqnaice identity, 90% of the pairs -were hoanoWons* 
below 25% less than 10% vrare- (fi) Whether or not 
sapience homology implied structural identify depended . 
eroriaUy on the alignment length. For example, if 10 
residues were aJbuflar in an alignment of length 16 £*60%). 
structural similarity could not be interred. (In) The «more 
afcnuar than identical' rule (discarding all pairs for which 
percentage nnrilariiy was lower than percentage identity) 
reduced false positive* significantly, (iv) "Oring intermediate 
sequences fin- finding links between more dirtant iamilies 
■was almost as raccesdCul: pairs were predicted to be 
homologous when die respective sequence families bad 
protein? in common, AD findings are applicable to auto- 
matic database searches. 

Ksywrdsi alignment qualily analyds/evohitronary conservation/ 
. S^ms amlysis/protem seo^ence aHgmncoi/seqnence space 



Introduction 

Ptntein sequence alignments in twOighi sons 
Pttrtein eeqnencftfi fold into unique tee-dimensional (3D) 
structures. However; proteins wife similar sequences afloat 
snnto striictares (Zuckedcandl and Pauling, 1965; DoolSe 
ISS1; DoolMe, 1986; Chothia and Les^ .1986). IndeeT^t 
pitfein pans with more than SO out of 100 identical residues 
7^^^ to rtrucmr ^5 r «maar (Sander and Schneider 
This Ingh robustness of stractnres with respect to 
^dne exchanges explains partly the xcQyustoess rfmganisms 
Ir 1 ,^ 6 ^- to Swi^P^^ation eirois, and it allows for 
toe variety mevohmoa (Zuctosdcsndl and Pauling, 1965- 
Zuckedcandl, 1976; Doolittlc, 1979, 1986). SmS ni™! 
mcnts hHve uncovered hc*»k>goii3 proteiiipeirB witulesenSn 
10% jjairwise sequence identity (Valencia et al 9 1991; Holmes 

SjSli 9935 m *> ***** * < 1996: 

Hnbbard et aL 3 1997). Indeed, most similar protein structure" 
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pans appear to hacve less fiian 12% pairwise smj^p m^. .n.. 
<*a*l»7). ^ a^^ET^n? 

berween all pain of similar etmcrares is snptwRsffly 8-iMt 

^ent^mades another region, the MO^tB^aST^m. 

(perfjtfle, 1994; 1997). Tnreadms aW&ItuS 
am « :ievedn»g homology pairs from H» rmdni^ zone 
^kEmdR^m^ 1993; Bryar* and AtachnT^sS 
1995; Ko* and Sander, 1996; SSppl andno^S 1996- 
Fiscte m *, 1996; Rest «d OT^ghbe, 199^0^- 
tltaiaJ sequence elignmeaB methods become problematic et 
much In^er values of seqaeaoe identity. Me&ods aftzn. M 
to canec% Jiygnprotdapate with 20^054p«irwiBe eeonewse 
^ntity. Hence, DooHnle (1986) coined me teimtwfliS zone 
ftr sequence alignments m mis region. Do me dimowies 
olaligmnent methods in mis zone reflect merely tef-hpfoai 
dfficnlfcee (smtistical significance of detection), a is &e 
twih^zoneden^ljyaparticnlarfeanirE of evohaion? 
Length-dependent ait-qffw rignifiamt sequence identity 
Panwiee sequence ideotxtsr (percentage cf residues identical 
between two proteins) is sot Bnf6ciect to define me twfliriit 
zone. TnnteiA, tberelarivetyo^immber of stractac 

paro waflaWe in 1990, Sander and Schneider (1991) defined 
ftleng^depenaew threshold for significant sequence idenrny. 
netoeaholdciirve defe^CdiibbedH^H^iTraawnX 
pnjiwtfteaial to the hwense sqnais-root of the length fa 

S?^S S ^ Ween 7 mi 80 residn e s . ««i was clipped t> 
satacffle at 25% Be^ence identity over more than 80^dd«K. 

^ 2? ^ nMTO 30 "^"^ Ies »*« [of 1O0 

ahped iadifierent stxnchires (Sander and Schneider, 199U 

^4?J^ far 66 ^ larger HJB (Bemstek 
«■ aL, 1977) Df 1997? ^ 

Hopping in sequence apace 

IT we conic! plot th e Bpace of protein sequences, would to 
observe the proton femflies as islands? UnSmnoatebr vn 
cannot telL Neverajelesa, ncerol m&rmaticm to teenScS 
flan n«w ^aaad « < 1905, ^ st^c^ p^^, 
and Cnppen, 1995) Epcce . ^ daaW 6Mrdhe . 
proten rannhes are -widened bjr exploiting the n^trvitTof 
hott»lo£ (PeaiBQ^ 1996): 0) a qnery serene* U.isSed 
toadataba^.saySWSS.PROTCBairochaM^^ 
WJB, seo^ences aligned at levels of signlflcatf smuTadly are 
n«d as new seeds U» and for each u ; SWISS-PROT it 
searched agais; (m 1 ) this procedme is repeated until no new 
"» ^IBok* i!Pace hopping may be used in 
ccrntansfaon wlh Imowledge from strtSI, wider. S 

1997) - 01 ? to^e ma ndoanatioa 
«»«medm salnpfe Sequence aligrouaas mpra to prediction 
n^hods ^1996, 1997). Recentiy, me t^nsXny^S 
femihes has Veen ejalmted successfblly to aUomS 
mcrease me yield k database searches pWben Abagyan 
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U^g presorted fc B 'mtemiedifljc sequence search' jndwd 
and Nemvald * al (1997) in^km^ed the^ =J 
^ 1997)]. Here, I conto the *S nS 

1 prC ^° t resuh8 05 awning a set of 792 sequence- 
3* (a !r^ m 25% sequenceideiiiitv) 

proteins of 3n»wn structae against PDB. Hie following 
q^esW were mvftstigatcd Is me number of protein pair* of 
structaes proportional to fbtc distace f£Tfc C 
HSSP-cunrejeqn 1), OTdomlsej^tr^eiiicre^ 
m to twihgrt zone? Is me curve defined Sander 3 
Sender (1991) itm ^ Would using sequence smtfaSJ 

ider and Sander)? Finally, can the accural be improved for 
pair ah^nTnentsbye^ert rules? TTie resnfrs ver^Tpartially. 
earher. work based on a 1000-fold larger data set (Strand 
Schneider, 1991). The novel aspects were (a) a dfmmtian of & 
toe^oJd for smiilarity (eqn 2), and a refinement of the 
flrreshoH for identity; (ii i an mirodaction of various emert 
rules. -Aspects largely coropleineiiring oflier analyse* were 
^f^^^^'^Parkrf^ 1997 ; Brermere/< 
1998): (l) a Iargc^cale efvaluation of exploiting intermediate 
sequences (seqiience-gsace-hoppijag) ; (ii) a detailed analysis 
of true and felse posrtives providing estimates for accuracy 
and coverage of database searches; and (iii) a comparison with 
BLAST, one of the most jroprilar memods for rapid databases 
searches (AJtschul et al„ 1990; Altschnl and Gish, 1996). 



Method* 

Data set; 792 sequence-unique protein structures 
ftxrtein databases are biased towards parti<ttlarp*ottia fcmilies 
lb reduce this bias, analyses are usually restricted to represcnt- 
atrve data sets (Hobohm et al, 1992). Here, I chose me 
maximal set of sequeace-Tmique proteins of lcnown structure 
available hi early 1997 (Holm and Sander, 1996). 'Sequence^ 
unique 1 was defined as 'no pair in me set fells above the 
HSSP-corve (eon 1; Sander and Schneider, 1991), As a rule- 
pf-ftumb no pair had mere man 25% pairwise seouence 
ider^ Each of these proteins was aligned against the subset 
of PDB contained in the earry 1997 release of the FSSP 
database of protein structure alignments (Holm and Sander, 
1996). This subset ataounted in total to about 5646 wrtem 
chains. Obviously the second step (792 versus 5646) re- 
urtrodnced bias into the results. However, aligmng the 792 
eetfaawHaniirne pairs against memselves would »ot have 
yielded any result for most of the twQight zone analysed here, 
ttros, 792 versus 5646 was the best corr^romise in retiarins 
bias and immiroriug me biased region. The resulting test set 
was the largest possible set of proteins for which structural 
information was available (and tails false and correct Mts 
could be automatically dislingnished). 
Generation of sequence alignments 

Protem pahs were aligned by two different program types. 
(i) Full dynamic programming as implemented m the Smith- 
Watery (Sn^ and Waterman, 1981) based zaemod 
MaxHom (Schneider, 1994) (McUchlan metric, with min- 
imum = -0.5, marimntrj = 1.00, and gap open =» 3 S m 
elongation = 03); and (ii) quick database searches as irnrJ*- 
86 



merged by the two versions of me BLAST series- BLAST* 

KE^S-S^ 10 » b^»toW 

ZLrtJ^^^ 31 Hn,aalioas (CPU^e) KquhedT 
£ ♦""JWgBDwh, Marys* to the W 
2000 fatt for each of the 792 tS^sSl (Note 2 
ssatactan applied oajjr to the final displayed aliimraent. Of 

^^mentalsonftm.) Ibatesufch^rhnu^te setSSted 
atom 1.7 mflhon pnmise afigmnente. For &e campS 

4c (kta set had to he reduced to all pafa ftat woe^S 

^ASTT", nor PSI-BLAST ^nld be forced to report absoiS 
. wrong, jue. ALL pairwise aligmumtB). 
Definition ef sequence identity and sapience similarity 
0) Pairwise sapience identity was defined "by the nercenraae 
of residues identical between two aHened vxmavxs (cZ 
asparhc notching aspartic cennts 1: D - D = 1; aspartic on 
gtattaaicwas anm-matak D -E = 0). (ii) Pairwise ieqocno 
sunilanty was defined by die percentage of residues siadh: 
between two seqnenees (e. S . D - D « l } and aspartic cn 
glntaimc was now considered a match." D-E > 0). Smjflarit? 
scores depend on tie particnlar metric used to capture piyaico 
cnemresl properties of amino acids (note: most am^ ? 
are not considered 100% srmilar to themselves by typical 
nwaaces, as sucfa metdces are tesed cm log-odds, e.g. fa-fiie 
McLacblaa metric only F s W, Y and C yield 100% self, 
sunflaory), Consequently, levels of snuflarlty are not directl? 
cornparaMe between difBsrent tnetrices. For camtmraHUty I 
nse4 the MeLachlan metric (Ctabatov el al, 1987) also wed 
to. fee HSSP database (Sclmeider et aL, 1997). In principle, 
there are two ways to convert annflarity into percentage values 
(i) bynormalumg the similarity score by the ttndmal possible 
scare observed in a given metric (percentage retdihife s imilar^ , 
and («) by setting an arbitrary threahold of tie smjflarity score 
to distmgmsb ainrilai^iot similar and counting fee percentage 
of residues that are similar according to this threshold (percent- 
*Sf°f lesidnej). Again, I followed the practice of the 
HSSP database compiling the percentage residue similarity 
(normalized by maximal possible scores). When conwflin* 
percentages, the nnniber of identical residues was nonnaliad 
oy the number of residues aligned, gaps were ignored. 
Standard of truth for structural similarity 
^™^^* etwe ™ two protem structures is not tnwuely 
defined. Different structure alignment methods yield different 
**• 1S92 ; Ho ^i et al, 1993; Luo */ aL, 
1993; Orengo, 1994; Crippen and Maiowv, 1995; GerstS 
ar^vdt, 1996; Holm and Sander, 1996; Orengo and Tky ]otj 

^L^ 3 ^^ S ^ 199 ^ *«* «fferenc« can be 
Erf?^ 88 mnsttated ^ difierences between the expert- 
baBed debase of structural alignments SCOP (Murzin Jol 
1995; Brenner et aL. 199S; Hubbard et aL, 1997), and tfc 
antomatLcaDy generaied databases CAIH (Orenzo e al 1903 
1997) and FSSP (Holm and Sander, wRl^S,' ^ ■ 

S^VX £ H<Wever ' this is only a trend For many esaamles. 
^ up »»s structural siroiariry and FSSP does not Here! 
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Twilight sonc of prottin tcqoencc alignments 




F^l. Sketch <if Bc^ufliicfr^acc-fcoppwig. The triangks defines threa search 
protejju (A. B and C) having mutually J«» than 25% sequence identity, The 
cucjas define the three famffett (&U wqucmcs inside fee circle '"dfratrd by 
irbifcaEy ***** auu_fpedcf J»vo more maa 25% sequence identify to th* 
respective saarch proteins A, B and C), SeqiiAafte-cpac&Jioppa^g implies 
joining the ci r r us representing tfe> jznrfein. fhmllto (4a shown Air protons A 
andB la, the striped cifcifts) if they contain, identical ontoics (faflt are 
aligned in tsft same region (^^erfrin me example givea). 

chose die FSSP database <a standard of tmoV: any pan for 
which FSSP listed a significant score [zDALJ > 4 (Holm and 
Sander; 1996)] of stractoral srmibrity was considered to be 
staictarally similar; In order to distinguish between true and 
false positives fins decision implied mat an pairs not listed at 
file given cut-off of the FSSP database were structurally not 
similar. However, this brought up the problem of different 
strncmrc alignment methods. For example SCQP may consider 
a pair stxoctnralry similar, and FSSP may not Thua J arMitirmqfly 
all pairs were excluded from the analysis thai were Hated in 
FSSP but with lower z-scoxes. Even mat stffl left pairs of 
proteins with clear levels of sequence identity (more man 
40%) which -were not found Hsted in F$SP, Urns, I had 
to refine this procedure by serm-antoxnatically checking the 
structural similarity for about 2000 protein pairs all of which 
bad levels of above 50% pairwise sequence identity [note this 
nnmber was negligibly small, as cnly 1% of all pairs were 
found above mis -value (Kg. 3B)!], The particular way m which 
the standarxJ^f-txnta was constructed implied that estimates for 
true positives might be Bluffy optimistic, estimates for felse 
negatives slightly pessimistic. 

Concept of true and false hits 

VTbsn ChotMa and Leak (1986) first analysed me relation 
between sequence and structure similarity, they monitored fbe 
details of strnctural (hfierencea, and found mat the differences 
are inversely proportional to. me level of sequence identity. 
The binary notion of 'similar structure' (true or raise) used in 
this analysis reflected a different focus; the goal was to estimate 
fixe accrmcy in correctly detecting ramer than in correctly 
aligning homologues. Did fine imply mat correct detection and 
correct alignm ent were not correlated (as often the case for 
dreading; Bryant and Atecfeul, 1995; Lemer ei al t 1995- 
Sfcpl, 1995; Fischer er at, 1996)? Hot necessarily, but the 
feet is mat two homologies can be detected attbouga part— or 



Percents^Keqnencad^rrory 
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Distance &om HSSP threshold 

Jig. 2. Explosion of structurally dissimilar pairs ia ma twilight *ow. 
Numbers of ttue (pairs wftfa similar structure) and of Alio positive (pairs 
wfla.no simiiar structuc*) plotted vanms the distunes to the HSS?-eurva 
(Samfe and Schneider, 1991), L& the horizontal axas give me distance from 
^threshold defined in ttm J (Diimtaats refer to the parameter * in eon I), 
The ia^srfpainrita wquenca Eternity corresponding to to distance were 
wiown cd, top, (A) Nucaber of pairs observed at any distance (togBiimaiic 
soale). (B) Cunmlanvfl number of pain observed (Jc-garfthm* wale). For 
uampH ni a threshold ewmspondio* to about 32% sequence identify fej 
ohgrnnaafe, me mirnhm of true and false positive? were equal farrow 
mA); at about 29% eventha tunmJfflivfc numbers ol to* and fako positive 
were fiQua] (anew in B). Note: numbers of fen* nsgativei aidmlsa 
ajgstivojirtjuitfiomtfap wmulative Juma loft of the threshold; pcrceattro 
ax true and &J» positive given, in Figure 5. -»*™w«5« 

even the entrre^-olignmcnt is -wrong. (However, this extremely 
uiitarmg point was not pursued jfurther in this analysis) 
The following cases were 'distmgmshed: (i) true positives, 
a%nmeirts between prote^ 

a given thrxshold (defined by the sequence alignment method); 
(ii) false positives, ahgnmeate between protehis of dissimilar 
suTicmre mat fen above a given threshold of the sequence 
al^nment; (iii) true negatives, alignments between proteins of 
crissimilar structure tim Ml below a given mresnola; and 
(iv) felse negatives, alignments between proteins of similar 
structure that fall below a given threshold. Note that negatives ' 
and 'positives' represent two sides of the same coin: at 
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F*£- 3. Pairwise sequence idamhy vanus alignment JangftL Thfi original 
HS$?-carve (Sunder and Schnnidst. 1991) (dotted circles, eqn 1) Speared 
to fit &a tiuft poaitiv«5 (hamoioguES, A) bettor ton tfcut fclse positives (Bl 
In contrast, the mow curv* pxopawd item (jBUed diamonds, eqn2} was more 
oanscrvativa in excluding positives. Note that dus to the Juigs mn&bw 
of pairs it* plots for trea (AJ and 08) positives appeared almost 
equally dau&\y populated (Figure 2 revealed the problem of Mich l scatter 
pict). 

any ujieshold extracted from the sequence alignment «, the 
following equations bold (for cumulative numbers); 

false negatives + tnie positives = all pairs of smflar structure 

true negatives -f false positives = all pairs of 
ftifflTmila** structure. 
Distance to HSSP threshold 

The HSSP-curve was originally defined "by (Sander and 
Schneider, 1991): 

An) - n 4- (290.15 Z^ 5 * fori < SO 

1 25 , doe L & 80 (1) 
where L gave the number of residues aligned between two 
mioterns; p 1 fee cut-off percentage of identical residnes over 
the I aligned residues; and n descxibod the distance in 
percentage points titan fhe carve (/j = 0 ccworespands to the 
original HSSP-curve; w = 5 to the of&cial HS$P database 
releases; curve plotted m Figure 3). Once Schneider and Sander 
88 



(1991)iad discovered the bask fractional dep 
Sequence identity and aBgnrnent leagfli, they merely had to 
fix two free parameters; the factor and the exponent. Bora 
were chosen to fit tie date observed in 1991, in particular to 
xeach values of 25% around alignment length of 80, and 
values of 100% around aliment length of 10. The praicMe 
toctioiial dependence described by equ 1 also follow* 
worn statistics, as 'was recently shown in an- elegant work 
(Alexandra and Soloveyev, 1998). Let^ f. = 1,„ 20) be the 
probability mat ammo acid i occurs in a protem, and m» the 
s core for randomly aligning two ammo acids i and J, The sccrc 
S afsa entire alignment caamenbe approximated by: 

S - <m> • L 

where <m> is me expectation value of /n^andi me alignment 
1*3*^ Ifthe Values of ^ are mdepea^ 
variables, it follows (after some elementary opemtions) that 
me relation between me standard deviation of me values of 
Wiffrn), and fhe resulting score dietribution (<Js) is: 

In their original article Alexandrov and Soloveyev work 
out an appropriate renscarmg of fhe dynamic programming 
alignment However, tins scheme cannot be applied after me 
alignment has been completed (as me threshold functions Used 
m this wodcX tamer it has to be implemented into the 
a l i g n m ent method. 

New curve Jbr length-dependent significance of pairwise 
sequence identity 

I attempted to solve the problems of me original HSSP-curve 
(e<m 1; Results) by dermihg the following curve for til* 
separation of true and false positives (Figure 3, grey hue with 
dotted circles); 

where L gave the number of residues aligned between two 
proteins; p the cut-off percentage of identical residues over 
the L aligned residues; and n described the distance in 
percentage points from the curve (n » 0 plotted in Figure 3). 
The coristemtfi in visually sdecting me functioii were (i) 
to main t ain the tunctional form defined by eon I (and suggested 
by the statistics of Akxandrov and Soloveyev, 1998); (ii) to 
hit me 10054 mark at aligtrmeots mat are too short to reveal 
arrything about structural srmflarity (= 11 residues); (in) to 
saturate at levels around 20% sequence identity (reached fcr 
length = 3 00); and (iv) to roughly reflect the observed gradient 
Saturation for long attgnxnents was realized by the fractional 
form of me ejmonent (note: the term + resulted in m 
essential decay). This 'sataauan 1 constraint also affiled 
the particular value of the factor (0 32 rather man about 0.5 
as suggested by the distribution of the data, Figure 4), 

New curve Jbr length-dependent significance of pairwise 
sequence similarity 

The original HSSP-curve was derived for sequence identity, 
notfcr Sequence srmilarity (Sander and Schneider, 1991). The 
fractional dependence between smiilarity and length appeared 
comparable to the one between identity and length (Results), 
Ihis prompted a simiiair definition fur the separation between 
true and false positives based on similarity: 

. p\n = n + 420 • Zr* 335 " * + <3) 



Received from <> at 4/25/03 5:03:03 PM [Eastern Daylight Time] 




50 100 150 200 

^fUDflbcr of xcsidQBS olx^^tcd 



250 




IOC ISO 200 

Number of residues aligned 



250 



Fi& 4. Paitwi$e sequence similarity versus alignment length. (AJ Correctly 
detected stotetuttl komologu^ &) hist positive Open circles, cdriual 
B5SP-CUIV6 (Sander and Srfinftidtt; 1991) (aqn, A); filled triangle now 
curve proposed Jura (eqn 3), 

where X, gave the number of residues aligned between two 
prot sms; jf^' defined cut-off for the percentage of residue 
similarity over the L aligned residues; and b described the 
distance in percentage points from the curve (n = 0 plotted 
in Figure 4). 

Sequenc&space-h opp ing 

Suppose proteins Aq and B 0 were less than 25% identical; 
J&milyjrf is given by; {4$, A,} (such (hat aH proteins in 
the £sm£ly ^ are more than 25% identical to Aq); anatogoosfy 
ftmfly B is given by: {S & 3,,.., *„,}. Although ^ 0 and 5 0 
dipaedby more ton 75%, it may well he true that both were 
aligned to the same sequences, Lc. that for some i and/: 4 = 
^ If this is the case, 'sequence-spac^opping* refers to 
dimply e*rmf\m£ both fcmlies A and* to become; {A& A lr .. r 
A» ^o, S u .. n B m ) (Figaro I). Technically, I described this 
situation by compiling a simple matrix H(AJ) mat contained 
me TTTrmher of ov erlapping proteins (i_e, those contained both 
m family A and B) between all proteins in the test set (792 
chains) and ell proteins in the search set (5646 chains). For 
example, H(A,B) = 5 implied mat test protein A and search 
protein B had five identical proteins in their famify alignments. 



Might fcni£ Of Protein *cqnem* ■Bgnm nttT 

T^famOy alignments were taken from me HSSP database 
(Schneifltr et aL t 1997) wrm a cut-off at: HSSP-cnrve + 10% 
«o7 10 m eim l )' for a 3a«nrnent8 longer than 80 residues, 
35%pairwise sequence identity was required. All pxctem.pain 
(A3) in the twilight zone were investigated for which M(A.B) 
was larger than aero. Note, the concept of sequencc-space- 
hopping explored here is being used m. everyday sequence 
analysis. Hie novel idea mrrodnced by others (Abagyan and 

252"* 19971 Nettwrfd ** Park et al> 1997) wa* 

NOT to use seqnence-apace-hopping, but to use it for retracing 
false positives in large-scale sequence analysis. Here, I smmly. 
applied mis concept was applied to me large data set explored,- 
and investigated its usefulness m dependence on various' 
parameters. 

More-sim{lar~ffum-id&zficaI rule 

A Simple rule^-tfaumb was explored: accept hits only fa? me 
level of sequence similarity was higher man the level of 
sequence i&mmy. This rule may appear to be' nonselective in 
mat sirwlariry would always be larger than identity; however, 
for the given defimrion of similarity (using the MdUchlsn 
metric), this was not roe case. ■ - 

Remit* 

Number of false positives exploded in twilight zone 
In contrast to 1990, when Sander and Schneider (1991) 
compikd their data* now protein pairs of dissimilar stracturt 
were detected above the 30% cut-off (Figure 2A). And mcse 
were not exceptions: at a level of 32% (HSSP-curve + 7%, 
Le. n = 7 in eqn 1), me number of false positives already 
equalled mat of homologues. For me anginal HSSP-curve the 
number of false positives was 20-fold higher man me number 
of true pairs. The transition from 20 to 30% sequence identity 
was mghry non-linear for true, and false positives (logarithmic 
scales in Figure 2): the iramber of true pairs rose by a factor 
of 5> that of fake pairs by a factor of 200 (Figure 2B). Thna, 
below me region of significant pairwiee sequence identity 
(>34%) the population of false positives exploded. However 
also foe vast majority of homotogues had less than 30% 
sequence identity. 

Functional shape of original HSSP-curve adequate 
The functional shape of the original HSSP-ctnve proved to be 
basically correct. (Figure 3, grey line with triangles). Howevei, 
the larger data set analysed here revealed several problems in 
detail (Figure 3B). (i) A threshold of 25% was not reasonable 
firr an alignment length below 150-200 residues, (ii) Above 
an alignment lengm of about 100 residues, the derivative of 
foe curve separating true and raise positives should be lower 
than at lengths below 80. 1 attempted to solve these problems 
by defining a new curve for separating true and felse positive* 
(eqn 2; Figure 3, grey hne with dotted circles). Toe particular 
junctional form guaranteed an approximate saturation for long 
ali g nm ent s. For alignments shorter than 11 residues cqn 2 
yielded values above 100%. However, this was acceptable as 
100% identity for foments oiT 10-11 residues docs not imply 
structural snritlm rty (Cerpa et o/. s 1996; Minor and Kim, 1996* 
Uodoz and Serrano, 1996), The new curve saturated around 
20% for alignments over more than 250 residues. 
defining a curve for pairwise sequence similarity 
Compiling sequence identity neglects the physico^herrucal 
nature of amino acids. Any multiple sequence alignment 
illustrates that, for example, me feature hydrophobicity is more 
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conserved than ia me residue type. For me mflHon protein 
pars mveetigated here, this was reflected in a sMft of the 
scalier plot towards lower percentages (Figure 4). In particular, 
for longer alignments false positives fiOl "below 15% pairwifie 
sequence similarity. This prompted the introduction of a 
threshold specifically for sequence similarity (eqn 3 in 
Methods; Figure 4, grey line with dotted circles). The curve 
surpassed 100% for aUgmnents shorter than 12 residues and 
saturated at atom 10% for alignments over more than 500 
residues, 

Better detection afhomologues in twilight zone by new 
curves 

The new curves for lengfc~dependfint cot-effe in sequence 
identity (eqn 2) and similarity (eon 3) resulted in clearly lower 
false positive rates (higher accuracy) than fhe original HSSP- 
curve (Figure 5B and C). This was paid lor by a lower number 
of line positives detected (lower coverage; Figure 5A). At me 
n ■ 0 (eqn 1-3), the old curve yielded about twofold more 
true positives, but more than 20-fbld more &lse positrves 
compared to the new corves for identity and similarity. Farmer- 
more, at any level of true positives detected, me number of 
felse positives was small e r for me new curves (eon 2-3) man 
for me original HSSp-curve (eqn 1; Figure 7). When applying a 
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cutoff according to mere sequence identity (ignoring alignment 
length), accuracy dropped below 10% at levels of 30% sequence 
identity (Figure SC). Urns, detection accuracy rose almost 
1 0-fold by the new curves. 

Improving detection accuracy by expert rule 
Experts often apply rnles-af^mmb to visuaUy distinguish true 
and false positrves. However, many of such simple rules 
appeared not valid for antomatic rnmlemenianaa, hi particular, 
the distribnt*ons of the number and length of insertions did 
not, on average, differ between, false and true positrves (data 
not shown). Detection accuracy improved marginally by apply- 
ing fhe following rules: (i) compile the distance for (he 
similarity score /r (eqn 3), and me identity score /r*(eon n 
average overborn ([** + and accept pairs when this 

average is above some threshold n; (ii) take pairs whenever 
etfher identrty or similarity surpassed the respective threshold 
(erther n f Vn f > (in) take pahs if bom values where above 
a given cutoff tftJn> > „). In ctmtrast, Action accuracy 
mcieased Significantly by applying the *mo*e^iinilar-W 
identical ' rule; accept hits found in a database search only if 
percentage similarity is larger man percentage ifeum Ttna 
constraint resorted in >9S% detection accuracy at tj = 0 cut- 
off levels (eqn 2-3), whne 2-4-fold less true positrves were 
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Fig. 6. Improving accuracy fay sequenOO^pDO*4npping. Distances were 
ccuaynied according to the old curve (040 1, *oW), and to the now curve fhr 
identity C«P 2, W). C&in^pQBudinjC lftv*it of ttquaiC* identity showa «a 
top. The mandative parcentogej of positives detected at a given cut-off 
distance wexo compiled for three d&eiefit Jheftfw&g strategies; hits were 
accepted if; at lean; one (HCA3) = 1), five: (H^E) - 4) «r 10 (H(A3J = 
10) proteins were eonunoa between two protein, families (Methods), 
CaO Cumulative percentage of true positive} (£u« positives = 100 - true); 
(B) cumulative number of true positives. The comparison of the tm& 
positives reached by intermediate sequences and ill true pos&vas (grey line 
in B, note; some as in Figure 2) stowed that (J) less Chun 1/1000 of the true 
positives weib leached by intermediate sequtoCcs; (ii) the number of pain 
xeodbod by mtenncdiHte sequences did oat exptodo in the twilight zone 
(scale qa tfea l«ft coven two orden of magnitude, that on the right only 
one). Number* for tour and false negatives would not make sense for this 
analysis: as we dant Jcnow all proteins, we cannot gonntude that two 
dttUias arc unrclntrd only because we eWt find a link between tbem. 



found at this level (Figure 5A and C). Hence, applied aa a 
conservative cut-off in automatic database searches, this iule 
proved rather powexfnl 

Improving detection accuracy by s&quw^spoce-hopping 

Hopping in sequence apace proved successful m discarding 
false positives^ Already the miTrrmal constraint to accept a pair 
if at least one protein was common between the two sequence 
nmnlies yielded levels of around 80% accuracy even down 
to cut-off levels Corresponding to 20% sequence identity 
(Figure 6A, compared with <2<M accuracy for the normal 
thresholds Figure 5C). Accuracy mcieased farther when more 
proteins were required to be caramon to both lamilies 
(Figure 6A). However, sequence space hopping was possible 
for only relatively few protein, pairs (Figure 6B). Furthermore, 
toe jmprovemeirt in accuracy was less clear using sequence- 
space-hopping than by applying the 'more-srTnfl aT*than~idenr- 
ical* rule (Figure 5). 



Accuracy versus coverage for BLAST and full dynamic 
programming 

The balance between accuracy (percentage of true pairs) 
and coverage (percentage of all true pairs) -n«M»g choosing 
anromatic thresholds according to a particular purpose of a 
database search. It also pexaaite comparing different methods 
(the higher the vames, foe better), (i) As expected, the 
commonly used simple level of sequence identity (disregardnxg 
ahgmnent length) proved, again, an eroemeiy bad choice, 
(h) Surprisingly, the &st database searching method BLAST 
performed relatively well in comparison to the fttll dynamic 
programming (pigoxe 7A). (iii) Bom BLASXP version 2 
and PS1-BIAST were almost as good as me full dynamic 
programming with the previously defined HSSP-threahold 
(Sender and Schneider; 1991). Qv) Best performance was 
achieved by the new threshold for similarity (eqn 3). (v) How- 
ever, the raw alignment score performed almost as wetl 
(vi) BLAST? (Ahschnl et aL 9 1990) performed rather similarly 
to the more elaborate and more recent PSI-BLAST (Ahschul 
et aL, 1997) (and fox "high 9 accuracy even slightly better, 
Figure 7A inset; aotc; given that standard parameters were 
chosen, this was not suipii smg) . The coiraponding Uncaholds 
were given in Figure 5B for The dynamic programming, and 
in Figure 7B fox the PSt-BLAST probabilities. 
Many false negatives at reasonable cutoff values 
The number of false negatives is often of interest, Le. me 
number of proteins that belong to a structure iarrnly bat were 
not detected above a given cut-ofF. For die data sets used here, 
the c uTTmlntj ve percentage of raise negatives was extremely 
high for all reason able cut-off levels (Figure 5D). The vast 
majority of all pairs of proteins with similar st ruc t ure populate 
the midnight zone below 10% sequence itathy (Rost, 1997). 
Thus, the extremely high raise negative: rates proved that 
methods ahgning two proteins merely based on the pairwise 
levels of sequence homology dearly rail to find the gold 
of database searches (and that older analyses that railed to 
describe this effect were based on biased data sets). 
Thresholds for practical use 

For simplicity the tactions (eon 1-3) were explicitly provided 
in tables (Rost, 1998). At levels of n = 0 (eon 1-3) the 
cumulative number of true positives were (Figure 5): HSSP- 
curve (eqn 1), 12%; new identity curve (eqn 2), 56%; new 
similarity curve (eqn 3), 73%. Ja order to achieve levels of 
99% CCirrcct hits m percentage points have to be added to the 
curves, where tn was BSSP-curve, m ~ 5; new identity curve, 
m = 5; new similarity curve, m e* 12, For comparison, 
applying the 'more^mnaT^thsn^o^tical' rule yielded levels 
above 99% down to m « -1. 

Conclusions 

Rapid transition from trivial to needle-m4iaystack problem 
The twilight zone of sequence pair alignments (20-35% 
pairwise sequence identity) was characterized by two non- 
Imear transitions, (i) The number ofhomologues (trne positives) 
rose by a factor of about eight (Figure 2A). I obtained a 
similar result from analysing the first four entire genomes 
(Rost, 1997) which indicated thai mis result was general, ralher 
than database dependent, (ii) The number of raise positives 
rose by a factor of 5000 (Figure 2B). Hence, separating true 
and false positives switched zrtm a trivial task (above 35%) 
to the problem of fading needles in a haystack (20-30%). 
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FS5. 7. Accuracy varsus coverage for various m-h^H* timchniA 
A flau c g c y w€5 defined as the evaiUative percentage of true positives (actual 
true/all Actual), coverage as the percentage of true positives that wen 
detected at a given threshold (oriuaJ true/ail to*). GAO Hnrahalds and 
ttothods showed: AidenHry, «&w threshold for len£tb~depeo4ent sequent* 
identity (eqn 2); bsimiJarttys naw threshold for temgaVdcpcndent sequence 
similarity (eon. 3); jJXSP-cuntf, curve proposed by Sander and ^ h yHir 
(1991; ctn\ l); S&fcwty, threshold given by Sequence identity alone, i*., 
disregarding fiTifflitnftnt Imtfh; aiignmenr score, ic<oxc> u£?d Jar fho dyaamic 
pwjtrammnig opttmizazjon MnzHom; otefz?, BLASTP version 2 (AUfiflhuI 
and Cash, 1996); prf-Mart, BLASTP version 3 (AltecbuJ et aL, I«P7), run ' 
with standard paramstan. Ike values for too BIAST methods were base* 
on the probability scores reported by those aJgotftfek&s. Hie BLAST Methods 
did not report all pairwisp a ligftittHatt. thus the dam set hod to be reduced to 
the subset &r which aligned piirt were reported by aJ! toe methods 
{MaxBom, BLASTP2, BLASTF3). Note that wnereas the curve* fertile 
BLAST methods, as well as for identity and snnUarity are likely to hoM up, 
ia g PO^ra l, the curve for tij* alianment score la valid for the particular 
implementation of the dynamic programming in M"«*fo ffl. fax tho 
particular choice of parameters (Methods). (B) Detail of the relation 
between Iho BLAST probability (here- for psi-blast), and tho eunmlariva 
number of 1rua/&tea hiti, as well a* pefCttrfnge accuracy aqyd covera'gc 



The explosion of false positives shed light cm fixe shape of 
sequence space. From 100-35% sequence identity, anyrcaidne 
exchange lesnltbg in a stable structure maintaina structure. 
JFiu-m 28-35% sequence identity, most residue exchanges 
maintain Structure. From 20-28% sequence identity, the 
absolute majority of residue exchanges leaning stable struc- 
tures populate different protein families. Is the erosion 
caused "by ftHtares of structure space? If one generates protein 
sequences at random (or randomly superposes non-ielafed 
proteins), the counts for most of the region above 10% 
sequence identity are negligible (Host, 1 997). Thus, altiuragi 
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it is obvious fcat we expect to rjadmnre pairs for lower levels 
of sequence identity based on mere statistics, the particular 
tran^onintrietwili^zon^ 

this analysis did not provide answers to whemer ox not the 
observed explosion may reflect structural (Chun* and Subbiah, 
1996) and/or ■fan^n^ constraints* 

Poor distinction between true and joke positives by 
sequence identity alone 

Even journals such as Cell, or EMBO provide an ample source 
for the following fallacy: 'mese two fragments of 16 residues 
adopt similar structures as mey have more ton 10 ajmiiw 
residues 9 . Thus, one of me most important messages of tins 
analysis might be the repetition of a point made by others 
(Sander and Schneider, 1991); high levels of sequence similar- 
* iry ox identity do not ascertain structural similarity (Figure 5). 
Instead, the levels of significant sequence identity and smnlarity 
depend on the alignment length (Figures 3 and 4), or the 
rapectrye raw score of the ahgnmexrt methods. 

Better distinction by new curves jbr sequence identity and 
similarity 

1he kngm-dependent cut-off for significant sequence identity 
pioneered by Sander and Schneider (1991) seeded refinement 
in several ways to account £xr the findings from a 1000-fold 
larger data set (i) shift towards higher values for shorter 
alignments; (ii) saturation for alignments longer than L50 
residues; (iii) definitian of new curve for levels of sequence 
similarity. These tasks were solved by introducing threshold 
curves fur significant sequence identity (eqn 2), and for 
significant sequence snulatiiy (eon 3). The precise dgfiw*;^ 
of me two thresholds was entirely empirical. However, the 
essential functional dependency of the curves was kept gi^to 
to what would be expected from pure statistical considerations. 
Although not true for all problems (Nielsen et aL, 1996), on 
average, sequence similarity was marginally more successful 
than identity in distingirislijng true and false positives. The 
new curves improved accuracy at a given coverage (Figure 5 
and 7), AddTtirmfdly, this analysis supplied detailed levels fox 
expected accuracy and coverage far the curves defmed, as 
well as for standard BLAST searches (Figures 5 and 7). 
Such estrrnares may have implications for automatic database 
searches, Ihey also shed light on the comparison between 
sequence alignments and threading techniques that both only 
make use of pair comparison b O^cTU^iiSTr^farxdly specific 
profiles): already at levels of 25% sequence identity, pair 
aHgnments detect only 10-30% true positives. This is below 
me level of what tbnsading techniques achieve in fre interval 
0-25% sequence identity (Sippl, 1995; Fischer and Eisenberg, 
1996; Russell et al^ 1996; Rost ei af, 1997); 

Improved accuracy by 'mor^similar-thanAdenticai f rule and 
sequence space hopping 

The immber of false positives was sigrafjcantry reduced by 
two techniques (only the first of which was novel to this 
work), (i) the ^re-siinilar-tr^ rule: 95% of all 

pairs for which percentage similarity was larger than percentage 
id^nrhy had similar structures. Tons, this constraint cleady 
improved detection accuracy. The cost was low coverage: for 
only 10% of the structurally srmuar pairs me percentage 
similarity was larger man percentage io^ntity This might be 
expkined by the feet mat half of the protein, on average, 
embedded in loop regions, may tolerate residue exchanges thai 
do not conserve pb^sico^liernical properties (and thus decrease 
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foe overall average mare than the few to-be-comexved*regioDs 
increase it), (ii) The usage of 4 inulti-4mks' (Abagyan and 
Batalov, 1997), 'intermediate sequences'' (Park er al 7 1997] 
transitivity (NcowaM a/, 1997), or *seqnence space 
hopping 1 : most protein pairs that contained a fttmilar subset of 
identical proteins in 1heir respective sequence famHiw? were 
found to have simi?Br structures even at low levels of sequence 
homology. Obviously, foe validity of transitivity (detection 
accuracy) between protein firnjHes (figure 1) depended on 
the distance between the famines (Figure 6). Interestingly, the 
inrprovewHit of accuracy hardly depended on the number of 
proteins requited to be coramonto two families. Two suggested 
that although the vast na^edty of protein pairs with 25% 
sequence identity had dissimilar structures, foe 'islands* popu- 
lated by structure families were weE separated. Unfortunately, 
for "die data set explored here, the yield of this analysis was 
found to be very low: oh average only erne in 1000 pairs was 
reached via intermediate sequences (Figure 6). Forfoermore, 
sequmce-space-bopping; resulted in clearly lower coverage/ 
accuracy ratios than did: the application of the 'moire-shnSlar- 
than-identicar role (Figures 5 and 6). 

Beginning of the 90s: over-estimation of sequence alignment 
methods 

Until 1996, very few people had taken -op the laborious 
task of objective large-scale analyses of protein sequence 
comparisons. Partially, because antcmatic suueurre comparison 
methods are ftiriy recent The few earlier woxkers (Sander 
and Schneider, 1991; Vogt ei a7., 1995; Gotoh, 1996) based 
their work on data sets of about 1000 pairs of protein structure 
alignm ents, Gotoh (1996) and "Vogt et al (1995) used me same 
set (Pascarella and Axgos, 1992) for testing different alignment 
methods, and a variety of siibstrration matrices. They focused 
on trKmrtoring the detailed accuracy in terms of number of 
residues eoxxectry aligned. Due to the small data set Vogtfif al 
(1995) found abort 9ft% true positives al 30% sequence 
identity (ignoring alignment length), and 50% true positives 
at 20% sequence identity, Fox the 1000-fold larger data 
set used here the Corresponding values were quite different 
(ignoring cOignment length); 11% true positives at 30% 
sequence identity; and 5% true positives at 20% identity. 
However; even the mora cons ervativ e analysis introducing 
the importance of alignment length for levels of significant 
sequence identity (Sander and Schneider, 1991) still over- 
estimated the possible levels of sequence id entity between 
proteins of dissimilar structure. 

End of the 90%; database searches do not teach the 
gold mine, yet 

The thresholds for sequence identity and sinnTanry denned 
here, as well as those established by others (Abagyan and 
Batalov, 1997; Brenner et al y 1998) cmmlemenfed the levels 
fox 'significance' provided by BLAST (Aftschnl and Gish, 
1996), FASTA (Pearson,. 1996) or other statistical analyses 
(Bryant and Altscbnl, 1995) by addressing the question "how 
significant is the significance of the respective alignment 
method?'. Based on quite different data sets the principal 
messages were similar. (?) most proteins of similar structure 
were not found bypairwise sequence comparisons at reasonable 
cut-off thresholds^ (ii) raw scores from dynamic programming 
methods were comparable to the original length-dependent 
cut-off thresholds for sequence identity (Sander and Schneider 
1991); (in) dynamic programming was only slightly superior 
to BLAST searches (Altschnl and Gish, 1996; Altschnl et al 7 



1997). However, in detail the nmnbera differed between foe 
recent analyses. Obviously, the absolute values depended 
cruciaOy on foe particular choice of foe data set. Abagyan and 
Batalov (1997) analysed various rabstitutxon metriceB on a 
data set ccrnpamble to the one used in this analysis. They 
concluded that raw alignment scores provide better separations 
between true and false positives man do length-dependent 
cut-offs for sequence identity and similarity. The difference 
between foedr result, and the one shown here may result from 
foe fact that Abagyan and Batalov (1997) used foe optimal 
choice of all parameters for armparjng foe raw alignm^ 
score to sequence identity and sunOarity. Brenner and co- 
workers have analysed foe accuracy and coverage for variona 
statistical scores (Brenner e/ al, 1998), Tbey used a wmmlecely 
Afferent data set than I did. An approximate comparison of 
foe two analyses was possible by foe reference point of 
simple identity (ignoring ah'gnment length)!. It seems that foe 
performance far the best separation method they fad (new 
PASTA.) was ectnpaiabie to foe unproved, simple thresholds 
defined here (eon 2-3). Here, foe BLAST probability -was 
found to be a relatively good way to separate true and false 
r^oatrves (Figure 7A): it was only sligtfry inferior to foe raw 
dynamxi programming alignment score, results for which hold 
up exclusively Jot foe particular choice of parameters and foe 
particular aHgomcnt algorithm used. 

Thresholds in practice 

The advantages of foe lenguWependent levels of identity and 
similarity (eqn 2-3) over other thresholds (Abagyan and 
Batalov, 1997; Alexandres and Scloveyev, 1998) was that 
these thresholds, in principle, are applicable to any alignment, 
and may relate more exrm'citry to structure. Identity and 
. Similarity can be compiled easily without having to re-do foe 
entire database search. In practice, this does not always hold 
up: (i) different parameters (e,g. foe way m which gaps are 
treated) may result in diffluent alignments; and (ii) the similar- 
ity values compiled hold fox foe choice of a particular metric 
(here McLacMan). Additionally, foe mxesholds intro dnced hen 
provide independent evidence for foe separation, andpermrrtcd 
foe ^plication of foe successful 'xrum^hMar-foan-identical* 
rule. 

Will the analysts hold up for the next 500 structures? 
The results given here based on .foe largest possible data 
set for which structural nligtenrnts provided a well-denned 
distinction between true and. false. One eoncrusion was that 
seven years ago (Sander and Schneider, 1991) the database 
was too small to capture foe details, TOU this also be true in 
2005? Answers have to remain speculative, (i) Although the 
database used in 1990 was 1000-fold smaller than foe one 
used here, some principle findings were verified, (ii) Assuming 
that mere are only 1000 folds in nature (ChotMa, 1992), and 
that these correspond to about 10 000 families^ then even foe 
full catalogue of all protein sequences would yield a data set 
essentially only 30 times larger than the one used here (note: 
foe data set used corresponded to about 300 different folds 
aligned against about 1000 families). 

Rather mora accurate, or more sensitive? 
An accurate and sensitive distinction between true and foisc 
positives is inmortant for automatic database searches. The 
new curves inrroduced here (eqn 2-3) proved sllghGy more 
sensitive (higher coverage) and more accurate *k*n foe previ- 
ously proposed curve (Sander and Schneider, 1991), lie 
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accuracy increased significantly by applying the 'morr^rmflar- 
ftan-identical* rate, and by sequence space hopping. However, 
^cc uracy w as gained at the ccpeas e of c average* "Which is more 
important? Ocarly 3 fixe evohrtionEny information contained in 
nnOtfele alignments is the single most important contribution 
to improving protein strttctore prediction in the 90's (Rnst and 
Sander, 1996; Host and ODonogrnie, 1997). Is the gain by 
increased diversity more important than the loss of accuracy 
when using alignments for structure prediction? lie answer 
depends on the particular prediction goal For example, for 
secondary structure prediction diversity is more important ih 97 1 
accuracy (cut-off at 25% vexsu* that at 30%), vcnereas far 
fix* prediction of solvent aceessibinty the opposite is trae 
(nnpubhahed}. Farthennore, as databases grow coverage may 
he less important than accuracy. Irrespective of individual 
preferences, the sharper the knife cutting between true and 
false positives, the better. This analysis has sharpened the 
knijfe a little, and added new optional tools to it 
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